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Abstract — We consider DNA codes based on the nearest- 
neighbor (stem) similarity model which adequately reflects the 
"hybridization potential" of two DNA sequences. Our aim is to 
present a survey of bounds on the rate of DNA codes with respect 
to a thermodynamically motivated similarity measure called an 
additive stem similarity. These results yield a method to analyze 
and compare known samples of the nearest neighbor "thermo- 
dynamic weights" associated to stacked pairs that occurred in 
DNA secondary structures. 

I. Introduction 

Single strands of DNA are represented by oriented se- 
quences with elements from alphabet A = {A,C,G,T}. 
The reverse-complement (Watson-Crick transformation) of a 
DNA strand is defined by first reversing the order of the 
letters and then substituting each letter x for its complement 
X, namely: A for T, C for G and vice-versa. For exam- 
ple, the reverse complement of AACG is CGTT. For strand 
X = {xiX2 ■ ■ . Xn-lXn) G ^" = {A, C, G, T}", let 

^ = (x„a;„_i . . . X2Xi) e ^" = [A, C, G, T}" (1) 

denote its reverse complement. If j = x, then x = y for 
any x € A". If x = x, then x is called a self reverse 
complementary sequence. If x ^ x, then a pair {x , x) is 
called a pair of mutually reverse complementary sequences. 
A (perfect) Watson-Crick duplex is the joining of oppositely 
directed x and x so that every letter of one strand is paired 
with its complementary letter on the other strand in the double 
helix structure, i.e., x and x are "perfectly compatible." How- 
ever, when two, not necessarily complementary, oppositely 
directed DNA strands are "sufficiently compatible," they too 
are capable of coalescing into a double stranded DNA duplex. 
The process of forming DNA duplexes from single strands is 
referred to as DNA hybridization. Crosshybridization occurs 
when two oppositely directed and non-complementary DNA 
strands form a duplex. 

In general, crosshybridization is undesirable as it usually 
leads to experimental error To increase the accuracy and 
throughput of the applications listed in 1 1 1-15 1, there is a desire 
to have collections of DNA strands, as large and as mutually 
incompatible as possible, so that no crosshybridization can 



take place. It is straightforward to view this problem as one 
of coding theory [6 |. 

DNA nanotechnology often requires collections of DNA 
strands called free energy gap codes f?! that will correctly 
"self-assemble" into Watson-Crick duplexes and do not pro- 
duce erroneous crosshybridizations. When these collections 
consist entirely of pairs of mutually reverse complementary 
DNA strands they are called DNA tag-antitag systems \^ and 
DNA codes ||7l-||T3l. 

The best known to date biological model, which is com- 
monly utilized to estimate hybridization energy is the "nearest- 
neighbor" similarity model introduced in [l]. Roughly, it 
implies that hybridization energy for any two DNA strands 
should be calculated as a sum of thermodynamic weights of all 
stems that were formed in the process of hybridization. Stem 
is defined as a pair of consecutive DNA letters of either of 
the strands, which coalesced with a pair of consecutive DNA 
letters of the other DNA strand. This biological model leads 
to a special similarity function on the space A^. 

First known to authors constructions of DNA codes were 
suggested in ll9l- llT0l . They were based on conventional Ham- 
ming distance codes. Some methods of combinatorial coding 
theory have been developed |14|-|15| as a means by which 
such DNA codes can be found. From the very beginning it 
was understood that hybridization energy for DNA strands 
should be somehow simulated with the similarity function 
for sequences from A^ . But it can be easily noticed, that 
Hamming similarity does not in the proper degree inherit the 
idea of "nearest-neighbor" similarity model. Thus there is no 
wonder that further exploration activities primarily focused on 
the search of appropriate similarity function. 

One example of such function was proposed in [fT6l . where 
it was calculated as the sum of weights of all elements, 
constituting the longest common Hamming subsequence. Later 
attempts included deletion similarity |8|, which was earlier 
introduced by Levenshtein [l?] and block similarity lfT2l - lfT3l . 
Both functions are non-additive which allowed for consider- 
ation of such cases as shifts of DNA sequences along each 
other. Nevertheless, all of them still did not catch the point of 
"nearest-neighbor" similarity model. 



In 2008 we published our first work ifTSl . devoted to the 
study of stem similarity functions. There we considered the 
simplest case, when similarity between two sequences from 
is equal to the number of stems in the longest common 
Hamming subsequence between these two sequences. The 
common stem is understood as a block of length 2 which 
contains two adjacent elements of both of the initial sequences. 

In |T9l, we introduced the concept of an additive stem w- 
similarity for an arbitrary weight function w = w{a, b) > 0, 
defined for all 16 elements {ab) e A^, called stems. To 
calculate the additive stem w-similarity between two DNA 
sequences one should add up weights of all stems in the 
longest common Hamming subsequence between them (see, 
below Definition 1). Finally, our recent works [^Ol-fST] deal 
with non-additive stem w-similarity function, previously in- 
troduced in I?]. The given model also implies counting the 
weights of all formed stems between two DNA sequences 
with only difference that these stems are contained not in 
Hamming common subsequence but in subsequence in sense 
of Levenstein insertion-deletion metric. To find more detailed 
discussion of applicability of proposed constructions for mod- 
eUng DNA hybridization assays please refer to work [3. 

In current report we will summarize main results of |19 | in 
study of asymptotic behavior of DNA codes maximal size for 
additive stem w-similarity function. We will show how these 
results lead to the development of possible criteria called a 
critical relative w-distance of DNA codes for distinguishing 
between weight samples w{a, b) found in different experi- 
ments. We will also explain, how our consideration prompts 
the algorithms for composing DNA ensembles of optimal size 
for the given length of DNA strands. 

II. Additive Stem uj-Similarity Model 

A. Notations and Definitions 

The symbol = denotes definitional equalities and the symbol 
[n] — {1, 2, . . . ,n} denotes the set of integers from 1 to n. 
Let w — w{a, b) > 0, a,b € A, be a weight function such 
that 

w{a,b) — w{b,a), a,b ^ A. (2) 

Condition ^ means that w{a, b) is an invariant function under 
Watson-Crick transformation. 

Definition 1: ir7l, lfT9l . Vox x,y G A^, the number 

Sw{xj) = ^ sXixj), where 

1=1 

s™(jc y) - 1"^*^"' Xi^yi^ a, Xi+i = j/^+i = b, 

* ' 1 otherwise, 

(3) 

is called an additive stem w-similarity between x and y. 

Function Sw{x,y) is used to model a thermodynamic simi- 
larity (hybridization energy) between DNA sequences x andj. 
In virtue of (|2])-([T]i the function 

S^{x,y) = S^iy,x) < S^x^x), x,y G A^ (4) 



In addition, 

Sniixj) = Sn,{y,^), x,y G (5) 

Identity Q implies the symmetry property of hybridization 
energy between DNA sequences x and y ll7l- lfT3l . 

Example 1: In lITSl we considered constant weights w = 
w(a, b) = 1, a,b G A, for which the additive stem 1-similarity 
<Si{x,y), < Si{x,y) < Si{x,x) = n — 1, is the above- 
mentioned number of stems in the longest common Hamming 
subsequence between x and y. 

Example 2: Table 1 shows a biologically motivated collec- 
tion of weights w{a,b) = U{a,b) called [2| unified weights: 



U{a,b) 


b = A 


b=C 


b = G 


b = T 


a = A 


1.00 


1.44 


1.28 


0.88 


a^C 


1.45 


1.84 


2.17 


1.28 


a^G 


1.30 


2.24 


1.84 


1.44 


a = T 


0.58 


1.30 


1.45 


1.00 


Table 


: Unifiec 


weights U{a, b), 1998. 



The given values U (a, b) are based on weight samples which 
come from [2J and [5| and are the nearest neighbor "thermo- 
dynamic weights" (e.g., free energy of formation) associated 
to stacked pairs that occurred in DNA secondary structures. 
See 1 3 1 for an introduction to the nearest neighbor model. 

Taking into account inequality (|4|l, we give 

Definition 2: ||7l. lfT9l . The number 

n-l 

Vwix^y) = Sw{x,x) - Sw{x,y) = ^ ril'{x,y), 

4=1 

r^r{x,y)^sr{x,x)-sr{x,y)>Q, (6) 

is called an additive stem w-distance between x^y G A"". 

Let;«:(j) ^ (xi0>2(j) . • . a;„(j)) G A\ j G [N], be 
codewords of a g-ary code X = {x{l),x{2), . . . ,x{N)} of 
length n and size N, where = 2, 4, . . . is an even number. 
Let D, < D < max Sni{x,x), be an arbitrary positive 

number. 

Definition 3: fT\,(T9]. A code X is called a DNA code of 
distance D for additive stem w-similarity ([T]i (or a (n, D)^- 
code) if the following two conditions are fulfilled, (i). For 
any integer j G [N], there exists j' G [N], j' ^ j, such that 

x{j') = x{j) ^ x{j). In other words, X is a collection of N/2 
pairs of mutually reverse complementary sequences, {ii). The 
minimal w-distance of code X is 

V^iX) ^ min ixij),x{j')) > D. (7) 

Let Nw{n, D) be the maximal size of DNA {n, D)w-codes 
for distance (|2]i. If d > is a fixed number, then 

R^^d) ^ Irn^ ^'^S,N.Un,nd) ^ ^^^^ 

is called a rate of DNA (n, n(i)u,-codes for the relative 
distance d > 0. 



B. Construction 

Theorem 1: If n = 2t + I, t=l,2, 

Nx{n,n - 1) = 16. 



Let 



then 



Proof: Codewords of (n, n — l)i-code should not con- 
tain any common stems with each other. Note, that \A^\ — 16 
and hence for any {n,n — l)i-code X = {a:(1), . . .x{N)} 



Thus, 



{{xi{u)x2{u)), ue [N]} \ < 



Nj_{n,n- 1) < 16. 



16. 



Obviously, for odd n, the set doesn't contain self reverse 
complementary words. For stem a = (0102) £ A^, define 
x{a.) = (01020102 . . . 02010201) G A"". Code 



Xr {x{a.), a e A^}, 



IXJ = 4^ = 16 



constitute a DNA (ri, n — l)i-code of size 16 for additive stem 
1-similarity. Theorem 1 is proved. ■ 

Example 3: For instance, if n — 5, D = n — 1 = 4, then 8 
pairs of mutually reverse complementary codewords of code 
Xr are: 

{AAAAA, TTTTT), {AC AC A, TCTCT), 
{CCCCC, CGCGC), {CACAC, CTCTC), 

{AC AG A, TCTCT), {AT AT A, TAT AT), 
{CGCGC, CCCCC), {CTCTC, CACAC). 

Remark 1: Note that for any weight function w, the additive 
stem w-similarity S^, {x{aL),x{h)) = 0, a, b e A^, a ^ b. 
Hence, the minimal w-distance (|7]) of code X^ is 

X>^(Xr) min 5^, {x{j),x{j)) > 2t ■ w, 

j 

where w — min w(o, b). Thus, for any weight function w, 

a,beA 

the code X^ is also a {n, (n — 1) • u;)tu-code. For example, 
for the additive stem [/-similarity of Example 2, the number 
27c/ (Xj.) = 2t. Therefore, the code X^ is a {n,n — l)t/-code. 

C. Bounds on Rate Rw {d) 

Let p = {p{a, b), a,b ^ A} be an arbitrary joint probability 
distribution on the set of stems (06) e A^, i.e., 

p{a, b) = 1, p{a, 6) > for any a,b € A. 

a,beA 

To describe bounds on the rate Ry^{d), we will consider 
joint probability distributions p, such that the corresponding 
marginal probabilities coincide, i.e., for any a ^ A 

p,{a) ^ ^p(o,6) = ^p(6,o) ^ P2{a) > (9) 



beA 



beA 



and, in addition, function p{a, b), as well as weight func- 
tion (|2]l, is invariant under Watson-Crick transformation, i.e.. 



. A P{a,b) ^ p{b,a) 

Pi{ba) = — — -, p2{b\a) = — -— 
Pi{a) P2{a) 



denote the corresponding conditional probabilities. It is easy 
to check, that for distributions p with properties (l9b-(fT0ll. and 
for the corresponding conditional probabilities, the following 
equalities hold true for any a,b € A: 

Pi{a) ^ P2{a) ^ Pi{a) ^ P2{a), pi{b\a) ^ p2{b\a). (11) 

For a fixed weight function (|2|i, introduce values 

Tw = nmx T^(p), 



E 

a,b£A 



{p{a,b) -p'^{a,b)) w{a,b), 



(12) 



where the maximum is taken over all distributions p for 
which condition (|9]l hold true. Note, that if weight function is 
invariant under Watson-Crick transformation, then maximizing 
distribution of ( Ill-Ct will satisfy conditions ([TOli-dTTTl. 

Applying an analog of the conventional Plotkin bound (6], 
one can prove 

Theorem 2: IHl If d > then R.^{d) = 0. 

Let a: = {xiX2 ■ ■ ■ x„) G A" be the stationary Markov chain 
with initial distribution pi{a), o G ^, and transition matrix 

P = \\pi{b\a)\\, a,b e A, i.e. 



Piixi = 0} = Pi{a), Pr{a 



i+l 



b\xi = 0} ^ Pi{b\a) 



(13) 

for any o, 6 G ^ and i E [n — 1]. 

Let a distribution p satisfy (|9]l and let also the following 
Markov condition A4 be fulfilled: transition matrix P must 
define such Markov chain x = {xiX2 ■ ■ ■ Xn), that for any pair 
of states a,b E A there exists an integer m G [4] such that 
the conditional probability Pr{a;,„_|_i = b\xi — a} > Q. 

Theorem 3: [19] For any probability distribution p, satis- 
fying condition (|9]) and Markov condition M, and any relative 
distance d, < d < T^{p), the rate Rw{d) > 0. 

Theorem 2 is established using the ensemble of random 
codes where independent codewords x = {xiX2 ■ ■ ■ Xn) 
are identically distributed in accordance with the Markov 
chain ( fT3T l and, in virtue of ( fTTT i. the corresponding reverse 
complement codewords x = {xnXn-i ■ ■ ■ X2X1) have the same 
distribution (fT3] l as well. In addition, the proof of Theorem 2 
is based on the Perron-Frobenius theorem (see ll22l . Theo- 
rem 3.1.1). 

Let T^(p) be defined by (IlLCl l and 

^•^ ^ max T^(p). 



(14) 



p{a, b) = p{b, a) for any a,b e A. 



(10) 



If — T^ , then the corresponding weight function 
w ~ w{a, b) is called regular, and non-regular otherwise. If a 
weight function w ~ w{a, b) is regular, then T^ is called the 
critical relative distance of {n, (in)^-codes. 

From Theorem 2 and 3 it follows 



Corollary 1: |fT9l If a weight function w — w{a, b) is 
regular, then the maximal size of (n, n(i)^,-codes increases 
exponentially with increasing n if and only if < o? < T^. 

Remark 2: Results of Theorem 2 prompts an idea, that the 
construction of optimal random DNA codes for additive stem 
w-similarity should be based on generation of independent 
Markov chains with transition matrix P and initial distribution 
Pi(a), such that corresponding distribution p affords maximum 
in (HI). 



III. Weight Sample Analysis Based on Criterion 
OF Critical Relative Distance 

In this section, we will discuss samples of weight function 
(or, briefly, weight samples) w = w{a, b), a,b G A, taken 
from SantaLucia (1998) (see Table 1 in |2|). In Tables 2-8, we 
present weights ■w{A,A) — w{T,T) and samples of relative 
weights w(a, b) with respect to w{A, A), i.e., for any a,b ^ A, 

~ -/ M A w{a,b) _ _ - _ 

w = w[a,b) = — -—: — -rr, w(a, 0) = w{b, a) . (15) 



Pure numbers w{a, b) are comfortable for a mutual comparison 
and for the comparison with unified weights of Table 1 . 



w{A,A) = 0.43 


b^A 


b = C 


b = G 


b = T 


A 


1.00 


2.28 


1.93 


0.63 


a^C 


2.32 


2.84 


3.95 


1.93 


a = G 


2.16 


3.81 


2.84 


2.28 


a = T 


0.51 


2.16 


2.32 


1.00 



Table 2: Gotoh, 1981. 



w{A,A) = 0.89 


b = A 


b = C 


b = G 


b = T 


A 


1.00 


1.35 


1.52 


0.91 


a = C 


1.54 


1.84 


2.24 


1.52 


a = G 


1.40 


2.20 


1.84 


1.35 


a = T 


0.85 


1.40 


1.54 


1.00 



Table 3: Vologodskii, 1984. 



w{A,A) = 0.67 


b = A 


b = C 


b = G 


b = T 


a = A 


1.00 


1.69 


1.75 


0.93 


a^C 


1.78 


2.31 


2.79 


1.75 


a^G 


1.67 


2.76 


2.31 


1.69 


a = T 


1.04 


1.67 


1.78 


1.00 


Table 4: Blake, 1991. 


w{A,A) ^ 0.93 


b = A 


b = C 


b = G 


b = T 


a = A 


1.00 


1.63 


1.11 


0.89 


a = C 


1.35 


1.80 


1.77 


1.11 


a = G 


1.68 


2.62 


1.80 


1.63 


a = T 


0.75 


1.68 


1.35 


1.00 



w{A,A) = 1.02 


b = A 


b=C 


b = G 


b = T 


a = A 


1.00 


1.40 


1.14 


0.72 


a = C 


1.35 


1.74 


2.05 


1.14 


a = G 


1.43 


2.24 


1.74 


1.40 


a = T 


0.59 


1.43 


1.35 


1.00 



Table 6: SantaLucia, 1996. 



w{A,A) = 1.20 


b = A 


b=C 


b = G 


b = T 


a = A 


1.00 


1.25 


1.25 


0.75 


a^C 


1.42 


1.75 


2.33 


1.25 


a = G 


1.25 


1.92 


1.75 


1.25 


a = T 


0.75 


1.25 


1.42 


1.00 



Table 7: Sugimoto, 1996. 



w{A,A) ^ 1.66 


b^A 


b=C 


b^G 


b^T 


a = A 


1.00 


0.68 


0.81 


0.72 


a = C 


1.08 


1.66 


1.98 


0.81 


a^G 


0.85 


1.70 


1.66 


0.68 




0.46 


0.85 


1.08 


1.00 



Table 8: Breslauer, 1986. 



A. Analysis of Tables 1 -8 for Additive w-Distance 

Analysis of Table 1 and Tables 3-7: The given weight 
samples are regular and the maximum in dll-CI l is attained 
when p(a, 6) = if stem {ab) G L4, where the set L4 of 
forbidden stems in the Markov chain (fTsT i maximizing jll-CIl 
has the form 

U^{{AT),{TA),{AA),{TT)}. (16) 

Below, in Table 1' and Tables 3'-7', we present the estimated 
values of joint probabilities p(a, b) and marginal probabilities 
Pi (a) for which the maximum in dll-Ct is attained. Values of 
the critical relative distance T^j are given as well. 



p(a,6) 


b^A 


b^G 


b^G 


b^T 


Pi (a) 


A 





.0589 


.0081 





.067 


a = C 


.0610 


.1544 


.2095 


.0081 


.433 


a = G 


.0060 


.2136 


.1544 


.0589 


.433 


a^T 





.0060 


.0610 





.067 



Table 1': Unified weights [/(a, 6). Tu = 1.58. 



p{a,b) 


b = A 


b = G 


b = G 


b = T 


Pi{a) 


a = A 





.0706 


.0080 





.078 


a^C 


.0638 


.1411 


.2087 


.0080 


.422 


a^G 


.0147 


.1951 


.1411 


.0706 


.422 


a = T 





.0147 


.0638 





.078 



Table 3': Vologodskii, 1984. T^j = 1.61. 



p{a,b) 


b = A 


b = G 


b = G 


b = T 


Pi (a) 


A 





.0331 


.0346 





.068 


a = C 


.0406 


.1535 


.2037 


.0346 


.432 


a = G 


.0270 


.2188 


.1535 


.0331 


.432 


a = T 





.0270 


.0406 





.068 



Table 5: Benight, 1992. 



Table 4': Blake, 1991. Tij = 1.97. 



p{a,b) 


b = A 


b = C 


b=G 


b = T 


Pi{a) 


A 





.0675 


.0144 





.082 


a = C 


.0478 


.1326 


.2234 


.0144 


.418 


a^G 


.0340 


.1841 


.1326 


.0675 


.418 


a = T 





.0340 


.0478 





.082 



Table 5': Benight, 1992. Ts, = 1.58. 



p{a,b) 


b = A 


b = C 


b=G 


b = T 


Pi{a) 


a = A 





.0608 


.0095 





.070 


a = C 


.0616 


.1499 


.2087 


.0095 


.430 


a = G 


.0087 


.2102 


.1499 


.0608 


.430 







.0087 


.0616 





.070 



Table 6': SantaLucia, 1996. Ts = 1.55. 



p{a,b) 


b = A 


b = C 


b=G 


b = T 


Pi{a) 


a = A 





.0507 


.0140 





.065 


a = C 


.0444 


.1551 


.2217 


.0140 


.435 


a^G 


.0203 


.2091 


.1551 


.0507 


.435 


a = T 





.0203 


.0444 





.065 



Table 7': Sugimoto, 1996. 7~ = 1.50. 



Analysis of Table 2: The given weight sample is regular 
and the maximum in (Ill-Cb is attained when p{a, 6) = if 
stem {ab) G Lq, where the set of forbidden stems in the 
Markov chain ( fTlT l maximizing jll-CI ) has the form 

Le = {{AT), (TA), (AA), (TT), (AG), (CT)}. (17) 

Below, in Table 2', we present the estimated values of joint 
p{a, b) and marginal pi (a) probabilities for which the max- 
imum in (III-CI l is attained. The estimated value of critical 
relative distances T{s = 2.60 is given as well. 



p{a,b) 


b^A 


b^C 


b=G 


b = T 


Pi (a) 


a = A 





.0593 








.059 


a = C 


.0466 


.1427 


.2515 





.441 


a = G 


.0127 


.2261 


.1427 


.0593 


.441 


a = T 





.0127 


.0466 





.059 



Table 2': Gotoh, 1981. T^s = 2.60. 



Analysis of Table 8: The given weight sample w is a 
non-regular weight sample because the maximum in (III-CI ) is 
attained (with the maximal value — 1.70) for probability 
distribution p'{a,b), (ab) e A^, which does not satisfy 
Markov condition M and has the form: 



p'ia,b) 


b = A 


b = C 


b = G 


b = T 


P'lia) 


A 


.0344 











.034 


a^G 





.2190 


.2466 





.466 


a = G 





.2466 


.2190 





.466 


a = T 











.0344 


.034 



Table 8': Breslauer, 1986. TL = 1.70. 
This implies that for weight sample w from Table 8, we cannot 
estimate the critical relative distance of optimal DNA codes 
based on additive stem {y-similarity. 



B. Conclusion 

For regular weight samples from Tables 2-7 (T2-T7), the 
descriptive analysis and comparison of critical parameters are 
summarized as follows: 





T2 


T3 


T4 


T5 


T6 


T7 


L 








Li 


U 


Li 




2.60 


1.61 


1.97 


1.58 


1.55 


1.50 



where the corresponding set L (L = or L — Lg) of 
forbidden stems in codewords of optimal DNA codes, for 
which the critical relative distance T^j can be attained, is 
defined by ( fT6] l or by ( fTTI i. 
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